Trustworthiness of LLMs output is one of the important considerations for safe deployment of LLMs in production.Once of the straightforward way to do so is by measuring quality of selected outputs of LLMs. Reinforcement Learning from Human Feedback (RLHF) is one of the widely used method for better quality-calibrated models.
Since human feedback data is expensive to obtain, the paper has explored the use of token-level self-evaluation to improve the accuracy and quality of generated content by large language models. Experimental results show that self-evaluation based scores are effective in selective generation. and thereby improving the self-evaluation ability of LLMs to improve quality-calibration.
The paper proposes methods to convert open-ended generation into token-level evaluation tasks that the LLM can self-evaluate, such as multi-choice question answering or true/false evaluation. Two main methods are proposed: Sample & Select (multi-choice) and Sample & Eval (true/false).
Experiments on TRUTHFULQA and TL;DR datasets show the self-evaluation scores significantly improve calibration for selective generation compared to sequence likelihood scores. The hybrid method with a “none of the above ’’ option performs the best overall on accuracy, calibration AUC, and selective AUC metrics. Self-evaluation provides a way to improve calibration of LLMs for selective text generation, without needing extra training data.
One of the pitfalls I can see in this paper is that all experiments are carried out with PALM-2 and GPT-3 rather than GPT-3.5 or GPT-4 models as OpenAI API does not provide output log-probabilities for them. It will be interesting to see how Self-Evaluation holds up against GPT-3.5 + models.